Skip to content

test: add reproducer for case-insensitive write rejection (same field ID, different column casing)#562

Open
pandaamit91 wants to merge 3 commits intolinkedin:mainfrom
pandaamit91:ampanda/oh-case-insensitive-writes-repro
Open

test: add reproducer for case-insensitive write rejection (same field ID, different column casing)#562
pandaamit91 wants to merge 3 commits intolinkedin:mainfrom
pandaamit91:ampanda/oh-case-insensitive-writes-repro

Conversation

@pandaamit91
Copy link
Copy Markdown
Contributor

@pandaamit91 pandaamit91 commented Apr 27, 2026

Summary

We have noticed that OH writes with different column casing already succeed in some cases and we want to validate the existing behavior before applying any fix. This PR does that — it adds characterization tests that document exactly which write paths work today and which ones don't, with an explanation of why.

Key Findings

  • df.writeTo().append() already works with the default caseSensitive=false setting. Spark's Iceberg integration maps DataFrame column names to table column names case-insensitively at analysis time. The data files are written using the stored
    column names, so the commit carries the unchanged existing schema — writeSchema.sameSchema(tableSchema) is true and the server's validateWriteSchema is never invoked. DaliSpark (a wrapper over df.writeTo()) gets this for free.
  • Explicit column-list SQL INSERT always fails in Spark 3.1 when casing differs — even with caseSensitive=false. Spark 3.1's ResolveInsertInto rule matches the INSERT column list case-sensitively regardless of the config setting (this only
    governs column references in SELECT expressions, not INSERT column lists). The unresolved column is silently dropped, causing AnalysisException: not enough data columns.
  • The server-side normalization fix is scoped to non-Spark clients — Trino DML, direct Iceberg Java API, and plain REST — that send a PATCH body with column names in a different casing than what's stored. Those clients don't go through Spark's
    case-insensitive resolution layer, so the server is the only place to normalize.

Changes

  • Client-facing API Changes
  • Internal API Changes
  • Bug Fixes
  • New Features
  • Performance Improvements
  • Code Style
  • Refactoring
  • Documentation
  • Tests

Testing Done

  • Manually Tested on local docker setup. Please include commands ran, and their output.

  • Added new tests for the changes made.

  • Updated existing tests to reflect the changes made.

  • No tests added or updated. Please explain why. If unsure, please feel free to ask for help.

  • Some other form of testing like staging or soak time in production. Please explain.

    Test: Adds CaseInsensitiveWriteTest — a mock-based Spark e2e characterization test that establishes a baseline of which write paths already handle case-mismatched column names before any server-side fix is applied.

Screenshot 2026-04-27 at 2 33 44 PM

Additional Information

  • Breaking Changes
  • Deprecations
  • Large PR broken into smaller PRs, and PR plan linked in the description.

For all the boxes checked, include additional details of the changes made in this pull request.

… from stored schema

Responds to the reviewer's observation that "writes with different casing already
succeed." These tests establish the baseline behavior before any fix is applied.

Three scenarios are documented:

1. testPositionalInsert_succeedsRegardlessOfStoredCasing
   Positional INSERT (no column list) never needs to resolve column names, so
   casing differences are irrelevant. Works unconditionally.

2. testExplicitColumnInsert_succeedsWithDefaultCaseSensitivity
   INSERT with an explicit lowercase column list (e.g. "id") against a table
   that stores "ID" succeeds with the Spark default (caseSensitive=false).
   Spark resolves "id" → "ID" at analysis time, so the server receives the
   correct casing. This confirms the reviewer's observation.

3. testExplicitColumnInsert_failsWhenCaseSensitiveEnabled
   The same explicit-column INSERT fails with an AnalysisException when
   spark.sql.caseSensitive=true. "id" cannot be resolved against "ID" on the
   client before the request ever reaches the server.

Together these tests show: writes already work under the default Spark
configuration, but fail once caseSensitive=true is in effect — a gap that
exists independently of any server-side schema normalization.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@pandaamit91 pandaamit91 force-pushed the ampanda/oh-case-insensitive-writes-repro branch from d634c9c to 13395e9 Compare April 27, 2026 21:04
pandaamit91 and others added 2 commits April 27, 2026 14:10
… append

Add testDataFrameWriteTo_failsWhenCaseSensitiveEnabled to complete the
characterization of existing write behavior. With caseSensitive=true,
Spark cannot resolve lowercase "test" or ALL-CAPS "TEST" against stored
"TeSt", so both writeTo().append() variants throw AnalysisException
before reaching the server — documenting the gap that exists regardless
of any server-side normalization fix.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… Spark 3.1

Spark 3.1's ResolveInsertInto rule matches INSERT column list names
case-sensitively regardless of spark.sql.caseSensitive. Rename
testExplicitColumnInsert_succeedsWithDefaultCaseSensitivity to
testExplicitColumnInsert_failsEvenWithDefaultCaseSensitivity and flip
its assertion to assertThrows, matching the actual observed behavior.
Update class-level Javadoc to reflect the corrected findings.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@cbb330
Copy link
Copy Markdown
Collaborator

cbb330 commented Apr 28, 2026

Thanks @pandaamit91 it is clear the current state of the world.

what is the AI for spark client since the mixed case will throw error before it lands on OH server?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants